-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add validation functions and fixup sanitizer build #73
Conversation
When UBSan is enabled the equivalent flag is already added to the link options and the UBSan runtime gets linked in, but when enabling ASAN the build fails because it can't find a lot of symbols. The fix is to link it if you use it.
ASAN_OPTIONS is a runtime environment variable and doesn't affect compilation (unless exported as the symbol __asan_default_options, which it isn't here). Also note that enabling detect_leaks breaks macOS since AppleClang doesn't support this (though the LLVM from brew does).
unit makes use of streamvbyte_isadetection.h which is in an internal header not in the public API.
These are useful when loading untrusted data since it's typically trivial to change the uncompressed size that's stored alongside the encoded data. Doing so can cause the decode functions to read past the end of the input data, which at best generates bad output and at worst causes crashes if it happens to read into an unmapped page.
These presumably always passed so were never printed. Also include the iteration for easier debugging.
Fantastic work! Running tests. |
The encode functions return size_ts so we should match that.
Looks like the |
All the data would be about the same value [gap-1, gap+6] and wouldn't border on a byte width meaning that all of the keys would be the same and hence wouldn't catch bugs when reading the wrong keys. Instead use the gap to space the elements apart.
AppleClang 16 doesn't appear to be clever enough to transform the loop into a branchless one, so help it out by adding a loop that it knows how to unroll. Also do the same for streamvbyte_validate_stream_0124 though I haven't tested the performance of that.
Nothing fancy, just processing 4 keys per vector. In both micro- and macro-benchmarks this performs at basically the same speed as the loop added in the previous commit (tested with clang), both of which are significantly faster than the original version.
The speed of this ended up being a big performance hit on macOS because AppleClang16 doesn't appear to be able to unroll a loop. Giving it a hand significantly improves performance and I've added a basic NEON version that gives the same performance (at least under AppleClang16). Note that I haven't added an x64 version since we see no performance impact on Linux (using GCC, clang untested but should be able to unroll the loop on x64 too). I've also changed how |
@blawrence-ont Don't worry about some tests failing in CI for technical reasons. We can fix that in some other ways. |
@blawrence-ont Could you add a tiny bit of documentation in the README file ? Only a couple of sentences would be enough. If you don't do it, we can do it later. |
Merging! |
This PR adds 2 new functions to validate encoded streams. These are useful when loading data from untrusted sources (disk, network, etc...) since it's typically trivial to change the uncompressed size that's stored alongside the
encoded data. This has caused crashes for us where the decode functions have trusted the corrupted size and ended up reading into an unmapped page.
I was split between doing it this way or adding a
streamvbyte_decode_safe()
which also took in the size of the input, but this appeared to be the simpler of the two for error handling (at the expense of having to traverse the data twice if you need to check and then decode).Also as part of this PR I've fixed up the ASAN build so that it links (which caught a leak in the tests), fixed the Makefile build (presumably not used?), and fixed a typo. See individual commits for more detail.